Residual networks.
Here we have a simple schematic of a block in a vanilla neural network.
There
is an input vector X
it gets passed on through the neural layer where it is multiplied with
a weight matrix, it goes through an activation layer, perhaps it goes through another neuron
layer and then it gives a final output G of X.
This can be reformulated as a block T in a large
neural network where weights theta T and input XT give an output XT plus one
where XT plus one
is G of XT and theta T.
Here you're putting a value in and you're getting a value out.
This is a resonant block.
Here it's a little bit different.
There is an input vector X,
it gets passed on to a neuron layer where it's multiplied with a weight matrix, it goes through
an activation layer and then it goes to perhaps another weight layer.
Finally, the input X is
then added to the final output to give G of X plus X.
So reformulating to the same T plus one in T,
you get the equation that you see here.
So similar to before, you're just putting a value in and
you're getting a value out.
But there is a reason why resonants were way more successful when they
came out.
Firstly, skip connections help information flow through the network by sending a hidden state
X along with the transformation layer
along with the transformation through the layer F of X to get
T plus one, preventing important information from being lost.
This helped stabilize the training as
only the skip connections were sending information in the beginning as the weight layer was being
optimized for.
The resonant block allows for stacking that help forming in very, very deep networks.
And this is because of the nature in which back propagation worked.
To calculate how the loss
function depends on the weights
so dl by d theta
we repeatedly apply the chain rule in our
intermediate gradients, multiplying them along the way.
These multiplications lead to vanishing or
exploding gradients, which simply means that gradients approaching zero or infinity.
So gradient
descent relies on these gradients to move towards a minima.
So a zero or an infinity gradient is
really not going to help it.
So circling back to what happens in a resonant, you pass X zero to G
and theta zero, you pass X zero and theta zero to G and add X zero to get X one.
And then you go
through all the iterations of how many of our resonant blocks you have.
In the end, you get
Presenters
Zugänglich über
Offener Zugang
Dauer
00:06:41 Min
Aufnahmedatum
2025-11-04
Hochgeladen am
2025-11-04 16:05:11
Sprache
en-US